Machine learning applied to hyperspectral images

Authors:

In this project, the goal is to develop a classification pipeline of hyperspectral images. There are two given hyperspectral images given. Those hyperspectral images were acquired by the AVIRIS sensor. The specifications are 224 bands between 0.4μm to 2.5μm with a 10nm band width.

On these images, there are 16 different categories and a classifier classifying the pixels of the images must be found. The classifier must be the most accurate and the most generic. Deep learning is also forbidden (for educational purposes).

Many experiments will be performed in order to find the best classifiers with preprocessing, transformers, and searching of the best classifier.

Table of content:

Resources:

Import python depedencies. Remember to run pip install -r requirements.txt before running the notebook

Load data

Table of content

Hyperspectral image

The two hyperspectral images look different:

Luckily, the data type is the same (float32).

Load labels

Every images has 17 categories. Category 0 is considered as a non-category. This category is going to be called Other from now on. Also, the other category seems to be the main category which means many pixels are actually not labelled.

Even if the two images has the same number of categories, the categories between the two images are not related.

Dictionary label to category

Categories from Indiana Categories from Salinas
| # | Category name | |---|-------| | 0 | Other | | 1 | Alfalfa | | 2 | Corn-notill | | 3 | Corn-mintill | | 4 | Corn | | 5 | Grass-pasture | | 6 | Grass-trees | | 7 | Grass-pasture-mowed | | 8 | Hay-windrowed | | 9 | Oats | | 10 | Soybean-notill | | 11 | Soybean-mintill | | 12 | Soybean-clean | | 13 | Wheat | | 14 | Woods | | 15 | Buildings-Grass-Trees-Drives | | 16 | Stone-Steel-Towers | | # | Category name | |---|-------| | 0 | Other | | 1 | Brocoli_green_weeds_1 | | 2 | Brocoli_green_weeds_2 | | 3 | Fallow | | 4 | Fallow_rough_plow | | 5 | Fallow_smooth | | 6 | Stubble | | 7 | Celery | | 8 | Grapes_untrained | | 9 | Soil_vinyard_develop | | 10 | Corn_senesced_green_weeds | | 11 | Lettuce_romaine_4wk | | 12 | Lettuce_romaine_5wk | | 13 | Lettuce_romaine_6wk | | 14 | Lettuce_romaine_7wk | | 15 | Vinyard_untrained | | 16 | Vinyard_vertical_trellis |

In the two hyperspectral images, the categories are strongly unbalanced and the dominant category is the Other category.

Preprocessing

Table of content

Gaussian blur

Table of content

Let's see first the impact of a gaussian blur on rgb images.

The multi-dimensional filter is implemented as a sequence of one-dimensional convolution filters.

Why a gaussian blur may be useful as a preprocessing step. According to Pre-processing of hyperspectral images. Essential steps before image analysis, De-noising: The instrumental noise can be partly removed by using smoothing techniques. The given hyperspectral images have noise. This is instrumental noise that is acquired during the photo taking. One way to remove that noise is to apply a blur over the images.

A gaussian blur with a sigma of 5 will be applied to the input data. This value does not come out of nowhere. Later in the notebook, it is going to be proven that it the optimal value.

Let's plot the noise of some band and see how well works a gaussian blur over them.

As it can be seen above, in blue there are a lot of noise (the visible spikes). The gaussian blur works well because those spikes are "smoothed". In orange, the curves are much smoother. The noise disapeared. By applying a gaussian blur, hyperspectral images will be less affected by the noise and thus be more accurate when trying to classify.

Transformers

Table of content

Reshape inputs

Table of content

Standard scaler

Table of content

Standardize the features by substracting the mean and dividing the variance such that: $$ x_{new} = \frac{x - u}{s} $$

In machine learning, it is better to scale the data to look like a standard normal distributed data. According to scikit-learn: For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. Some algorithms assume the data are standardized. Thus, this is a must-do step.

Dimension reduction (PCA)

Table of content

Machine learning is more efficient with lower dimensions. The hyperspectral image have at least 200 dimensions. That seems way too large for any machine learning algorithm. Thus, we are going to reduce the dimension of the problem using the Principal Components Analysis (PCA) algorithm.

Documentation:

Setup PCA

Analyse PCA

The first principal components explains most of the variance. It is not needed to keep the last principal components. They usually hold the information of noise. As we want to reduce the dimensions, only the first principal components are going to be used to project data into a new space.

Make decision

How many principal components are we going to use to project data? We want to project the data in a way that at least a percentage (threshold) of the variance (information) is preserved. This information can be found from the cumulative sum of the expained variance. We pick as many principal components until their total explained variance is greater than the threshold.

Projection

The hyperspectral image is projected from 200 dimensions to the number of PC dimensions (5 if 95% is the threshold). Thus, the number of dimensions is strongly reduced.

Extra information

Mean Square Error (MSE)

The MSE is lower when the number of retained PCs is greater. It means the error is lower with a greater number of PCs. However, with a few PCs the error might be already low enough.

Thresholds

Function to get the number of principal components to use according to a threshold of the explained variance. This function is needed during a grid search (later used in the notebook).

Select K best

Table of content

Select the k best features. It might be useful to keep track of the overall most representing features. A score is computed for every features. The features with the highest scores mean they hold more information. Thus, those features are more relevant for the classification.

In the case of the indiana hyperspectral image, it can be seen that features have different scores. Thus, features have different impacts for the classification. For instance, features between 50 and 100 are poorly relevant compared to the features between 100 and 200. However, the top scores are all close to each other. It does not seem to be relevant to use this feature extractor because we would select some features, but missing a lot of information by not using the other one.

Classification API

Table of content

In this section, a few functions are implemented for conveniency, factorization and cleaness of the code. Moreover, the usage of a sklearn classifier is the same no matter the classifier.

Grid search with cross-validation

Table of content

A classifier may have several hyperparameters. The goal is to find the best classifier with the best hyperparameters for a given problem. The grid search performs a exhaustive search over specified hyperparameters.

Workflow:

  1. Create classifier
  2. Create the list of hyperparameters
  3. Search the best classifier (fit step). Try every combination of hyperparameters. For each combination, split the train data into 80% of train data and 20% of validation data. Fit with the train data. Score the classifier with the validation data. Repeat this 5 times for each combination until the whole data was covered by the validation data. Compute the mean score for each combination. Return the combination with the best score.
  4. Compute the score of the classifier with the best hyperparameters over the test data.

documentation:

In the above function, the grid search is performed through a classifier. Scikit-learn gives a way to create a special kind of classifier called Pipelines (scikit-learn). A pipeline consists of a sequence of transforms with a final estimator. The behavior is the same as a classifier but it also applied beforehand some transformation to the input data.

For instance, here is the workflow of a pipeline with a PCA (transform) then the classifier LinearSVC.

  1. Training
    1. PCA.fit(X_train) --> fit the transform
    2. Xproj = pca.transform(X_train) --> transform the training input
    3. LinearSVC.fit(Xproj, y_train) --> fit the final estimator
  2. Test/Validation
    1. Xproj_test = pca.tranform(X_test) --> transform the test input
    2. LinearSVC.predict(Xproj_test, y_test) --> predict the test input

The pipeline is very useful to not forget any steps of the transformation sequence. It is also very convenient to use.

Evaluation

Table of content

Confusion matrix

A confusion matrix usage is to evaluate the quality of the predictions of a classifier. The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. The higher the diagonal values of the confusion matrix the better, indicating many correct predictions.

The true labels are represented by the rows and the predictions by the columns.

Documentation:

Scores

The accuracy is computed this way: $$Accuracy = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} $$

The precision for a class is the number of correctly predicted images of this class out of all predicted images of this class.

The recall for a class is the number of correctly predicted images of this class out of the number of actual images of this class.

The f1-score is the combination of the precision score and recall score such that $$\text{f1 score} = 2\frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$$ It tries to compromise the precision with the recall.

The f1-score can be computed in two different ways for more way of result interpretation:

The closer these scores get to 1, the more accurate the classifier is.

Multiclass classification

Table of content

How are we going to solve the given problem. Let's take a look at the scikit-learn algorithm cheat-sheet. It shows the steps we need to be aware of when choosing an algorithm in order to resolve a machine learning problem. This way, a lot of missteps can be avoided.

all text

  1. Begin from START
  2. Are there more than 50 samples? Answer: yes, it is not required to get more data.
  3. Do we want to predict a category? Answer: yes, the category of each pixels must be predicted.
  4. Are the data labeled? Answer: yes, the two images comes with their respective labels.
  5. It is a problem of classification.

In this section, the best predicter of classes is going to be search. In the first part, we will see how to upgrade the dataset by removing a category. Then, we are going to search the best classifier respectively for the Indiana hyperspectral image and Salinas hyperspectral image.

With other VS without other

Table of content

In this section, the impact of the other category is checked. The data are strongly unbalanced especially with this category. As a reminder, this category means a lack of labelling. We want to avoid as much as possible bad accuracy because of it.

The checking will be processed this way:

  1. Test classifiers accuracy including the other category
  2. Test classifiers accuracy excluding the other category

The problem of the other category other can be found in every input (indiana and salinas). The verification is only performed with the indiana image. The result of this experience with any other image would be the same. Thus, it is not required to test over all the given images.

Separate the indiana data into two datasets. The first one includes the pixel labelled as other. The second one excludes them.

LinearSVC

Let's proceed to the experiment using the LinearSVC classifier.

Find best hyperparameters

To find the best hyperparameters for our model, we use the Grid Search with Cross Validation. It computes all the combinations with every parameters taken into account. In addition, it performs cross validation which means that the data is split between the validation data and test data by varying the data inside the split. Thus increasing the efficiency.

In the Grid Seach, the parameters that vary are the following :

Classification report

The accuracy of the classifier does not look so low (0.86). However, as it can be seen from the confusion matrix and the f1 scores, all the categories but the Other category are not predicted at all by the LinearSCV classifier. The f1 score of each category is computed. Then the macro average f1 score is computed. It consist of an average of the f1 score of each category by the number of categories. The value of the macro average f1 score is 0.79. It can be seen from the confusion matrix that the classifier only predicts some pixel belonging to the Other category (not even all of them).

Without the Other category, the classifier accuray is greater than the same classifier with the Other category (0.96). Another difference difference is the average macro f1 score. Now, the value of this score is 0.97 which is 20 pourcent greater than the previous one. It means the classifier is more able to predict any classes. It seems it is worth to remove the Other category in order to predict much better all the remaining categories. Moreover, the Other category means the pixels were not labelled. It makes sense to remove them as they can be considered as noise.

On the other hand, the LinearSVC classifier has a low accuracy in any case. Thus, let's double check with another classifier.

RandomForest (Inherently multiclass)

Table of content

Let's proceed to the test using the RandomForest classifier.

Find best hyperparameters

In the Grid Seach, the parameters that vary are the following :

Classification report

This time it is not as obvious as with the LinearSVC classifier. The accuracy and f1 weighted average score are equal to each other. But, the f1 macro average score is greater when the Other category does not exist. It can be seen in the confusion matrix. Once more, when the Other category does not exist the classifier is more able to predict every remaining categories.

Moreover, some small categories are shadowed by the Other category. As this category is very large, the classifiers tend to predict pixel belonging to the remaining categories as a pixel belonging to the Other category. It is not a wanted behavior at all. The classifier must be more sensible to small categories.

According to this experiment and its result, it makes sense to remove the classification of the Other category. This category is actually a non category. It is used for unlabelled pixels. Moreover, the performances of the classifiers are much worse because of this category. It is decided to remove the pixels belonging to the Other in the classification. The performances of the classifier will improve.

Indiana classification (bench)

Table of content

all text

The focus will be on the multiclass classifiers.

In this section, many experiments will be performed to search the best classifier for the 16 categories of the Indiana hyperspectral image. The problem is a multiclass classification. The multiclass classifier from sklearn will be used. There are three ways of classifying multiclass problem:

  1. One Versus One (OvO): Split the multi-class dataset into multiple binary classification problems. Separate two classes at a time, ignoring the other data. Assign new points with majority voting rule. Number of predicters: $N * (N - 1) / 2$.
  2. One Versus Rest/All (OvR): As OvO, split the multi-class dataset into multiple binary classification problems. But separate the class from all other classes. A binary classifier is then trained on each binary classification problem. Prediction are made using the model that is the most confident. Number of predicters: $N$.
  3. Inherently multiclass: On the opposite of the two first kind of classifiers, this one works for multiclass directly. No extra strategy is needed. It predicts straightforwardly the class of a input.

The output of three ways is the same: predicting the category of the input.

LinearSVC (One Versus All)

Table of content

Interpretation

With the LinearSVC(One versus All), the results are pretty satisfying because the mean accuracy, F1 score macro, and F1 score weighted are all above 0.96. However, the prediction of Stone-still-towers are a bit disapointing since it has the lowest accuracy, the model misinterpreted 3 Soybean-clean as Stone-still-towers.

RandomForest (Inherently multiclass)

Table of content

Interpretation

With the Random Forest algorithm, the results are incredibly satisfying. Indeed, the mean accuracy, F1 score macro and F1 score weighted are in the vicinity of 1. Only 2 mistakes were made by the model :

K-nearest neighbors (Inherently multiclass)

Table of content

In the Grid Seach, the parameters that vary are the following :

Interpretation

With the K nearest neighbors algorithm, the results are very satisfying. Indeed, the mean accuracy, F1 score macro and F1 score weighted are in the vicinity of 1. Indeed, only a few mistakes were made by the model.

SVC (One Versus One)

Table of content

In the Grid Seach, the parameters that vary are the following :

Interpretation

With the SVC, the results are the best ones so far. Only 1 mistake were made by the model :

This mistake is somewhat recurrent because it also appeared in the results of the LinearSVC. It is an almost perfect result.

Recap of all classifiers

Table of content

Here, we will do a recap of the different classifiers used and show a benchmark.

Let's quickly add a constraint of time.

As we can see on the table and graph above, the scores of the Linear SVC is below the rest. The random forest and the SVC are the best classifiers. The K-neighbors classifier looks very efficient but not as acurate as the top 2 classifiers in terms of accuracy and f1 score. kneighborsclassifier.

Moreover, in the case of a time constraint, the random forest classifier is the best classifier compromising efficiency, accuracy and speed.

But which classifier between random forest and SVC should be taken? The current problematic is to find the most accurate and generic classifier for hyperspectral pixels classification. The most accurate classifiers were already found. But are they generic? We have the chance to have a second hyperspectral image Salinas to test the genericity of the two classifiers. The best classifier will be selected according to its performance with the other hyperspectral image.

Salinas classification

Table of content

In this section, we will use the previous best algorithms for Indiana on Salinas : SVC and Random Forest. We are going to use the previously found best hyperparameters directly on the Salinas data. In case of a poor result, we will perform another Grid Search on the Salinas data with the different algorithms in order the find the optimal parameters for this image.

Same as before, the other category needs to be removed while performing the SVC algorithm on the Salinas data.

SVC (One Versus One)

Table of content

Interpretation

Using the SVC algorithm, the results can be considered as very promising already considering the fact that there are 5 times more pixels on the Salinas data. There are only a few mispredictions. Most of the mispredictions come from the confusion between vinyard_untrained and grapes_untrained which is an understanble mistake since they have similarities.

Random Forest

Table of content

Interpretation

Using the Random Forest algorithm, it is the best result we had since there is no mistake. The classifier is no longer confused between vinyard_untrained and grapes_untrained. The random forest classifier predicts every classes well equally even if the class_weigth parameter is not used.

Recap of salinas classifiers

Table of content

We will compare the different classifiers used for the salinas image.

The Random forest seems to be the most trustworthy classifier for the salinas image. As the score is really close to 1, there is not much difference, but we can see a slight better score for the Random forest algorithm. The random forest has made no mistake on the test data compared to the SVC which made a few mispredictions.

Impact of the gaussian blur

Table of content

In this section, the strength of a gaussian blur as a preprocessing is going to be prove. The accuracy of the same classifier will be evaluated over more or less blur input hyperspectral image.

Applying a gaussian blur on the hyperspectral input image improves significantly the scores of any classifiers.

Overall, a gaussian blur with sigma equals to 5 seems to be the best parameter. That is why it has been used throughout the whole notebook. It gives the best preprocessing for the hyperspectral image classification.

Performance of the best classifier

Table of content

Let's make a recap of all the experiments that have been made so far. On the first experimentation with the Indiana hyperspectral image, two very accurate classifiers were found:

However, these classifiers were so equals in terms of accuracy that it was impossible to pick one over the other one. Moreover, the goal is to find an accurate and generic classifier. Thus, a second experiment with these two classifiers was performed this time with the Salinas image. Through this experiment, we found out that the random forest algorithm works slightly better.

Our conclusion is that the best classifier is the random forest classifier because it is very accurate and the most generic.

For the indiana prediction, there were 2 errors. For the salinas prediction, there was no error. The random forest classifier is very accurate and very generic. It works perfectly on the two different hyperspectral data.

As a reminder, if we had kept the other category the score will be still high but not as high as now.

Moreover, the classifiers did fit partially (~85% of the pixels) the two input. The important point is that within the not fit pixels (~15%) the classifiers still predict them accurately.

Tasks distribution

Table of content

Louis Ilan Raphael
Preprocessing X
Bench (SVC) X X
Bench (KNN) X
Bench (Linear SVC) X X
Bench (Random forest) X X
PCA X X X
Explanations in notebook X X
Classification API X
Slides for presentation X X X